Main Protocol
(1) Construct a co-expression network
Here, we apply weighted gene coexpression network analysis to the expression matrix. This section heavily represents the content from the WGCNA tutorials created by the author's of WGCNA, found here: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/
Setting WGCNA options
options(stringsAsFactors = FALSE)
Quality Control of Expression Data
In this section, the samples in the expression matrix are clustered and plotted to identify outliers.
#Using a cluster tree to find sample outliers
sampleTree = hclust(dist(norm_exp_mat), method = "average")
# sizeGrWindow(12,9)
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
cex.axis = 1.5, cex.main = 2)
Power Calculation for Network Construction
Next, the power used for calculating the network is empirically chosen by calculating scale-free topology and mean connectivity.
powers = c(c(1:10), seq(from = 12, to=20, by=2))
# Call the network topology analysis function
sft = pickSoftThreshold(norm_exp_mat, powerVector = powers, verbose = 5, networkType = "signed")
## pickSoftThreshold: will use block size 1529.
## pickSoftThreshold: calculating connectivity for given powers...
## ..working on genes 1 through 1529 of 29255
## Warning: executing %dopar% sequentially: no parallel backend registered
## ..working on genes 1530 through 3058 of 29255
## ..working on genes 3059 through 4587 of 29255
## ..working on genes 4588 through 6116 of 29255
## ..working on genes 6117 through 7645 of 29255
## ..working on genes 7646 through 9174 of 29255
## ..working on genes 9175 through 10703 of 29255
## ..working on genes 10704 through 12232 of 29255
## ..working on genes 12233 through 13761 of 29255
## ..working on genes 13762 through 15290 of 29255
## ..working on genes 15291 through 16819 of 29255
## ..working on genes 16820 through 18348 of 29255
## ..working on genes 18349 through 19877 of 29255
## ..working on genes 19878 through 21406 of 29255
## ..working on genes 21407 through 22935 of 29255
## ..working on genes 22936 through 24464 of 29255
## ..working on genes 24465 through 25993 of 29255
## ..working on genes 25994 through 27522 of 29255
## ..working on genes 27523 through 29051 of 29255
## ..working on genes 29052 through 29255 of 29255
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.0985 -74.30 0.978 14600.00 14600.00 14900.0
## 2 2 0.3610 -59.50 0.873 7540.00 7530.00 7900.0
## 3 3 0.6250 -41.80 0.890 3990.00 3980.00 4390.0
## 4 4 0.6770 -27.10 0.901 2170.00 2160.00 2550.0
## 5 5 0.7300 -19.30 0.925 1210.00 1200.00 1540.0
## 6 6 0.7800 -14.40 0.945 687.00 678.00 967.0
## 7 7 0.8490 -10.60 0.938 399.00 391.00 632.0
## 8 8 0.9160 -8.79 0.954 237.00 230.00 448.0
## 9 9 0.9600 -7.29 0.968 143.00 138.00 333.0
## 10 10 0.9810 -6.03 0.977 88.20 83.60 258.0
## 11 12 0.9920 -4.38 0.994 35.30 32.30 174.0
## 12 14 0.9910 -3.37 0.994 15.20 13.10 131.0
## 13 16 0.9900 -2.75 0.989 7.06 5.61 105.0
## 14 18 0.9770 -2.36 0.970 3.53 2.50 87.2
## 15 20 0.9710 -2.08 0.964 1.91 1.16 74.1
# Plot the results:
# sizeGrWindow(9, 5)
# par(mfrow = c(1,2));
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence")); text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red");abline(h=0.9,col="red")
# this line corresponds to using an R^2 cut-off of h
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"));text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
#### power = 8
#### lowest power to provide a scale-free topology fit of at least 0.9
#### Low mean connectivity score
Calculating the Network
Using the lower power that results in a scale-free topology fit of at least 0.9, and thus a low mean connectivity score, the network is calculated.
net = blockwiseModules(norm_exp_mat, power = 8,
TOMType = "signed", minModuleSize = 20,
networkType = "signed",
reassignThreshold = 0, mergeCutHeight = 0.15,
numericLabels = TRUE, pamRespectsDendro = FALSE,
saveTOMs = TRUE,
saveTOMFileBase = "./data/network/norm_exp_mat_TOM",
verbose = 3)
## Calculating module eigengenes block-wise from all genes
## Flagging genes and samples with too many missing values...
## ..step 1
## ....pre-clustering genes to determine blocks..
## Projective K-means:
## ..k-means clustering..
## ..merging smaller clusters...
## Block sizes:
## gBlocks
## 1 2 3 4 5 6
## 4996 4980 4964 4880 4739 4696
## ..Working on block 1 .
## TOM calculation: adjacency..
## ..will not use multithreading.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ..saving TOM for block 1 into file ./data/network/norm_exp_mat_TOM-block.1.RData
## ....clustering..
## ....detecting modules..
## ....calculating module eigengenes..
## ....checking kME in modules..
## ..removing 280 genes from module 1 because their KME is too low.
## ..removing 318 genes from module 2 because their KME is too low.
## ..removing 77 genes from module 3 because their KME is too low.
## ..removing 499 genes from module 4 because their KME is too low.
## ..removing 168 genes from module 5 because their KME is too low.
## ..removing 121 genes from module 6 because their KME is too low.
## ..removing 38 genes from module 7 because their KME is too low.
## ..removing 2 genes from module 8 because their KME is too low.
## ..removing 2 genes from module 9 because their KME is too low.
## ..Working on block 2 .
## TOM calculation: adjacency..
## ..will not use multithreading.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ..saving TOM for block 2 into file ./data/network/norm_exp_mat_TOM-block.2.RData
## ....clustering..
## ....detecting modules..
## ....calculating module eigengenes..
## ....checking kME in modules..
## ..removing 530 genes from module 1 because their KME is too low.
## ..removing 654 genes from module 2 because their KME is too low.
## ..removing 96 genes from module 3 because their KME is too low.
## ..removing 188 genes from module 4 because their KME is too low.
## ..removing 21 genes from module 5 because their KME is too low.
## ..Working on block 3 .
## TOM calculation: adjacency..
## ..will not use multithreading.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ..saving TOM for block 3 into file ./data/network/norm_exp_mat_TOM-block.3.RData
## ....clustering..
## ....detecting modules..
## ....calculating module eigengenes..
## ....checking kME in modules..
## ..removing 1156 genes from module 1 because their KME is too low.
## ..removing 1071 genes from module 2 because their KME is too low.
## ..Working on block 4 .
## TOM calculation: adjacency..
## ..will not use multithreading.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ..saving TOM for block 4 into file ./data/network/norm_exp_mat_TOM-block.4.RData
## ....clustering..
## ....detecting modules..
## ....calculating module eigengenes..
## ....checking kME in modules..
## ..removing 562 genes from module 1 because their KME is too low.
## ..removing 360 genes from module 2 because their KME is too low.
## ..removing 264 genes from module 3 because their KME is too low.
## ..removing 215 genes from module 4 because their KME is too low.
## ..removing 151 genes from module 5 because their KME is too low.
## ..removing 79 genes from module 6 because their KME is too low.
## ..removing 22 genes from module 7 because their KME is too low.
## ..removing 6 genes from module 8 because their KME is too low.
## ..removing 3 genes from module 9 because their KME is too low.
## ..Working on block 5 .
## TOM calculation: adjacency..
## ..will not use multithreading.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ..saving TOM for block 5 into file ./data/network/norm_exp_mat_TOM-block.5.RData
## ....clustering..
## ....detecting modules..
## ....calculating module eigengenes..
## ....checking kME in modules..
## ..removing 657 genes from module 1 because their KME is too low.
## ..removing 269 genes from module 2 because their KME is too low.
## ..removing 119 genes from module 3 because their KME is too low.
## ..removing 130 genes from module 4 because their KME is too low.
## ..removing 126 genes from module 5 because their KME is too low.
## ..removing 64 genes from module 6 because their KME is too low.
## ..removing 53 genes from module 7 because their KME is too low.
## ..removing 36 genes from module 8 because their KME is too low.
## ..removing 18 genes from module 9 because their KME is too low.
## ..removing 33 genes from module 10 because their KME is too low.
## ..removing 35 genes from module 11 because their KME is too low.
## ..removing 40 genes from module 12 because their KME is too low.
## ..removing 13 genes from module 13 because their KME is too low.
## ..removing 26 genes from module 14 because their KME is too low.
## ..removing 6 genes from module 15 because their KME is too low.
## ..removing 5 genes from module 16 because their KME is too low.
## ..removing 5 genes from module 17 because their KME is too low.
## ..removing 3 genes from module 18 because their KME is too low.
## ..removing 6 genes from module 19 because their KME is too low.
## ..removing 3 genes from module 20 because their KME is too low.
## ..removing 2 genes from module 21 because their KME is too low.
## ..Working on block 6 .
## TOM calculation: adjacency..
## ..will not use multithreading.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ..saving TOM for block 6 into file ./data/network/norm_exp_mat_TOM-block.6.RData
## ....clustering..
## ....detecting modules..
## ....calculating module eigengenes..
## ....checking kME in modules..
## ..removing 1087 genes from module 1 because their KME is too low.
## ..removing 681 genes from module 2 because their KME is too low.
## ..removing 158 genes from module 3 because their KME is too low.
## ..removing 126 genes from module 4 because their KME is too low.
## ..removing 58 genes from module 5 because their KME is too low.
## ..removing 10 genes from module 6 because their KME is too low.
## ..removing 7 genes from module 7 because their KME is too low.
## ..merging modules that are too close..
## mergeCloseModules: Merging modules whose distance is less than 0.15
## Calculating new MEs...
table(net$colors)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 10659 1463 1274 1077 894 806 788 751 708 690 661 635 631
## 13 14 15 16 17 18 19 20 21 22 23 24 25
## 527 521 488 469 438 432 406 372 297 287 282 276 270
## 26 27 28 29 30 31 32 33 34 35 36 37 38
## 245 231 218 213 188 143 132 124 120 116 97 97 90
## 39 40 41 42 43 44 45 46 47 48 49 50 51
## 89 83 80 77 76 60 60 58 56 55 54 54 48
## 52 53 54 55 56 57 58
## 47 47 45 44 40 37 29
Saving the Network
The components of the network required for downstream analyses are saved so the network does not need to be calculated multiple times.
moduleLabels = net$colors
moduleColors = labels2colors(net$colors)
MEs = net$MEs;
dim(MEs)
## [1] 42 59
geneTree = net$dendrograms[[1]];
save(sft, MEs, moduleLabels, moduleColors, geneTree,
file = "./data/network/network.RData")
Loading the Network
In order to return to this script and run downstream analyses without recalculating the network, the necessary artifacts can be loaded using this code.
load("./data/network/network.RData")
Annotating the Network
In order to interpret the network it is useful to label the genes in the network with identifiers, e.g. Ensembl gene IDs, entrez IDs, etc. Using the biomaRt package, mappings from those identifiers to gene symbols can be acquired. These steps will vary based on the organism you use and the gene identifiers used in the list of phenotype and disease associated genes generated above.
# establish the database that matches your organism
ensembl = useMart("ensembl")
mm19 = useDataset(mart = ensembl, dataset = "mmusculus_gene_ensembl")
# These functions may be used to determine the attributes and filters to apply in the database query
# listAttributes(mm19)
# listFilters(mm19)
# Next, query the BioMart database for annotations for the genes of interest and merge the output of the query with the network results
m = getBM(attributes = c("ensembl_transcript_id", "external_gene_name", "description", "chromosome_name", "start_position", "end_position"),
filters = c("ensembl_transcript_id"),
values = colnames(norm_exp_mat),
mart = mm19)
gene_MEs <- rbind(colnames(norm_exp_mat), moduleColors, moduleLabels)
gene_MEs <- t(gene_MEs)
colnames(gene_MEs)[1:3] <- c("ensembl_transcript_id", "mod_color", "mod_number")
gene_MEs <- as.data.frame(gene_MEs)
annotated_mods = merge(gene_MEs, m, by = "ensembl_transcript_id")
# now you can filter annotated_mods to see the genes in each co-expression module
annotated_mods %>%
filter(mod_number == 1)
(2) Gene Ontology Analysis
In this section, we describe the process of identifying gene ontology processes that are enriched in each coexpression module, and look at an example results table.
While tools exist for performing gene ontology analysis in R, our tool of preference for this project is ToppFun, of the ToppGene suite: https://toppgene.cchmc.org/enrichment.jsp, which is a web interface tool. ToppGene is unique in that it not only searches for enriched gene ontology categories and pathways for your genes of interest, but also human and mouse phenotypes, publications and published coexpression data sets, gene families, microRNAs, drugs, and diseases.
ToppFun reports back the significance of each enrichment with an assortment of multiple testing correction methods, the number of hits for that category from your query ("Hit Count in Query List"), and the total number of genes in the category ("Hit Count in Genome").
# You can extract the gene names for a module by writing the module out to a .csv file which can be fed into external tools, like ToppGene
annotated_mods %>%
filter(mod_number == 1) %>%
write_csv(., "./data/network/mod_1.csv")
# From ToppFun, you can click the "Download All" button from your results page, and get all of the results, so you can save them, browse them in R, and compare different modules in R
url = "https://www.cell.com/cms/10.1016/j.celrep.2020.108145/attachment/60cdba8a-aa04-4c22-8ca4-4f8f0e6173dd/mmc3.xlsx"
download.file(url = url, destfile = "./data/gene_ontology/toppfun_res.xlsx")
toppfun_1 = read_excel("./data/gene_ontology/toppfun_res.xlsx", sheet = 1, skip = 1)
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in J2086 / R2086C10: got 'NA'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in J2211 / R2211C10: got 'NA'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3600 / R3600C3: got 'Color representing the co-expression
## module used for GO analysis'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3601 / R3601C3: got 'Number of genes in each co-expression
## module'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3602 / R3602C3: got 'Category of query GO term'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3603 / R3603C3: got 'GO ID'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3604 / R3604C3: got 'Query GO term'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3605 / R3605C3: got 'P-value of enrichment'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3606 / R3606C3: got 'Bonferroni corrected p-value'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3607 / R3607C3: got 'Benjamini-Hochberg corrected p-value'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3608 / R3608C3: got 'Benjamini–Yekutieli corrected p-
## value'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3609 / R3609C3: got 'Number of module genes in GO category
## list'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3610 / R3610C3: got 'Number of gene in the GO cateogry'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3611 / R3611C3: got 'Genes from query list overlapping
## hits'
toppfun_1
(3) Creating GWAS gene list
Next, a list of genes implicated by the GWAS we loaded above will be generated. In the case of the GWAS we used, fine mapping was performed to identify a subset of significantly associated SNPs that are indepedent associations and lead SNPs, so here, we'll read in the lead SNP file and use it to subset the full summary statistics for the study.
Generating GWAS regions (SNPs with LD > 0.7 for GWAS SNPs)
## Reading in lead SNPs
url = "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5621629/bin/NIHMS73769-supplement-Supplementary_Table_2.xlsx"
download.file(url = url, destfile = "./data/gwas/gwas_bmd_lead_snps.xlsx")
lead_snps = read_excel("./data/gwas/gwas_bmd_lead_snps.xlsx", sheet = 1, skip = 3)
bmd_snps <- subset(gwas_sumstat, gwas_sumstat$SNP %in% lead_snps$RSID)
# Some of the variants are not SNPs, but indels. LDLink recognizes them by a different naming convention, so we have to modify their identifiers for LDLink to find their proxies
bmd_snps = bmd_snps %>%
mutate(SNP = word(bmd_snps$SNP,1,sep = "_"))
bmd_snps$SNP[grep(":", bmd_snps$SNP)]
## [1] "1:22483649" "10:29087203" "17:27961561" "22:19677948"
bmd_snps$SNP[6] = paste0("chr", bmd_snps$SNP[6])
bmd_snps$SNP[163] = paste0("chr", bmd_snps$SNP[163])
bmd_snps$SNP[260] = paste0("chr", bmd_snps$SNP[260])
bmd_snps$SNP[301] = paste0("chr", bmd_snps$SNP[301])
# Next we batch query LDLink to get all of the proxy SNPs
LDproxy_batch(snp = bmd_snps$SNP, pop = "EUR", r2d = "r2", token = "c0f613f149ab", append = TRUE)
## error: rs191147097 is monoallelic in the EUR population.,
## error: rs149333699 is monoallelic in the EUR population.,
## error: rs184953495 is monoallelic in the EUR population.,
## error: rs10668066 is not a biallelic variant.,
## error: rs143581991 is not in 1000G reference panel.,
## error: rs746652558 is not in 1000G reference panel.,
## error: rs12806687 is not in 1000G reference panel.,
## error: rs72186592 is not in dbSNP build 151.,
## error: rs202234992 is not in 1000G reference panel.,
# some SNPs will return an error in the LD search. We will stil annotate the nearest up and downstream genes from these SNPs, but their window will just be their coordinate location
no_ld_snps = c("rs191147097", "rs149333699", "rs184953495", "rs10668066",
"rs143581991", "chr10:29087203", "rs12806687", "rs72186592", "rs202234992")
# we format these so we can combine them with the LD 0.7 regions below
no_ld_regions = bmd_snps %>%
filter(SNP %in% no_ld_snps) %>%
dplyr::rename(query_snp = SNP) %>%
group_by(query_snp) %>%
summarize(min = min(BP), max = max(BP), chr = paste0("chr",max(CHR)))
# We read back in the results of the LDLink proxy query
proxies = read_tsv("./combined_query_snp_list.txt", skip = 1, col_names = FALSE)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## X1 = col_double(),
## X2 = col_character(),
## X3 = col_character(),
## X4 = col_character(),
## X5 = col_character(),
## X6 = col_double(),
## X7 = col_double(),
## X8 = col_double(),
## X9 = col_double(),
## X10 = col_character(),
## X11 = col_character(),
## X12 = col_character()
## )
colnames(proxies) = c("index", "query_snp", "rs_id", "coord", "alleles", "maf", "distance", "d_prime", "r2", "correlated_alleles", "regulome_db", "function")
# now we filter for proxies with R2 greater than 0.7, and for each query_snp we will create an entry in our regions table with the max and min distance from the
ld_regions = proxies %>%
filter(r2 >= 0.7) %>%
separate(col = coord, sep = ":", into = c("chr", "coord")) %>%
group_by(query_snp) %>%
summarize(chr = max(chr), min = as.numeric(min(coord)), max = as.numeric(max(coord)))
# then we combine both sets to get one set of regions of LD >= 0.7 for each GWAS lead SNP
ld07_gwas_regions = bind_rows(no_ld_regions, ld_regions) %>%
dplyr::select(chr, start = min, end = max, query_snp)
# make sure none of the regions have a negative width
for(i in 1:nrow(ld07_gwas_regions)){
if((ld07_gwas_regions[i,3]-ld07_gwas_regions[i,2]) < 0){
print(paste0("swap row #", i))
start = ld07_gwas_regions[i,3]
end = ld07_gwas_regions[i,2]
ld07_gwas_regions[i,3] = end
ld07_gwas_regions[i,2] = start
}
}
## [1] "swap row #46"
Intersection GWAS regions with reference gene file
Next, we want to find the intersection between these regions and the full human geneset. Our gene set was generated from the hg19 reference genome using the UCSC gene tables. It is important to match the coordinates of the genome used in the GWAS study to the gene set
## Reading in UCSC RefSeq gene table ##
url = "https://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz"
genes = read_tsv(url, col_names = FALSE)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## X1 = col_double(),
## X2 = col_character(),
## X3 = col_character(),
## X4 = col_character(),
## X5 = col_double(),
## X6 = col_double(),
## X7 = col_double(),
## X8 = col_double(),
## X9 = col_double(),
## X10 = col_number(),
## X11 = col_number(),
## X12 = col_double(),
## X13 = col_character(),
## X14 = col_character(),
## X15 = col_character(),
## X16 = col_character()
## )
colnames(genes) = c("bins", "refseq_id", "chr", "strand", "txStart", "txEnd", "cdsStart", "cdsEnd", "exonCount", "ExonStarts", "ExonEnds", "score", "name", "cdsStartStat", "cdsEndStat", "exonFrames")
genes = genes[,c(3,5,6,2,13)]
genes = genes %>%
filter(str_detect(refseq_id, "NM_"))
# check that the column types are correct and chromosome notation matches
head(genes)
head(ld07_gwas_regions)
# Next, we're going to use the Genomic Ranges package to annotate our ranges with genes
gr_genes <- GRanges(genes)
gr_gwas <- GRanges(ld07_gwas_regions)
# We find the genes that overlap the ranges from the GWAS study
overlaps <- GenomicRanges::findOverlaps(gr_genes, gr_gwas)
overlaps_info = cbind(ld07_gwas_regions[overlaps@to,], genes[overlaps@from,]$name)
overlaps_info = overlaps_info[,c(1,4,5)]
overlaps_info$type = "overlaps"
overlaps_info = overlaps_info[!duplicated(overlaps_info), ]
colnames(overlaps_info) = c("chromosome", "gwas_snp", "gene_name", "type")
# We find the nearest genes preceding the ranges from the GWAS study
nearest_precede <- GenomicRanges::precede(gr_gwas, gr_genes)
precede_info = cbind(ld07_gwas_regions, genes[nearest_precede,])
precede_info = precede_info[,c(1,4,9)]
precede_info$type = "precede"
precede_info = precede_info[!duplicated(precede_info), ]
colnames(precede_info) = c("chromosome", "gwas_snp", "gene_name", "type")
precede_info = precede_info[-which(precede_info$morris_snp %in% overlaps_info$gwas_snp),]
# We find the nearest genes following the ranges from the GWAS study
nearest_follow <- GenomicRanges::follow(gr_gwas, gr_genes)
follow_info = cbind(ld07_gwas_regions, genes[nearest_follow,])
follow_info = follow_info[,c(1,4,9)]
follow_info$type = "follow"
follow_info = follow_info[!duplicated(follow_info), ]
colnames(follow_info) = c("chromosome", "gwas_snp", "gene_name", "type")
follow_info = follow_info[-which(follow_info$morris_snp %in% overlaps_info$gwas_snp),]
# Finally, we combine these results into one table
gwas_region_genes = rbind(overlaps_info, precede_info, follow_info)
gwas_region_genes
Identify mouse homologs for GWAS genes
Now, we now have a list of human genes that overlap the regions as defined by LD from our study of interest, however we may also need the equivalent list in mouse. We can use a homology map to generate a list of mouse homologs for the human gene list from MGI (http://www.informatics.jax.org/faq/ORTH_dload.shtml)
## Reading in homology map ##
url = "http://www.informatics.jax.org/downloads/reports/HOM_MouseHumanSequence.rpt"
homology <- read_tsv(url)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## `HomoloGene ID` = col_double(),
## `Common Organism Name` = col_character(),
## `NCBI Taxon ID` = col_double(),
## Symbol = col_character(),
## `EntrezGene ID` = col_double(),
## `Mouse MGI ID` = col_character(),
## `HGNC ID` = col_character(),
## `OMIM Gene ID` = col_character(),
## `Genetic Location` = col_character(),
## `Genomic Coordinates (mouse: , human: )` = col_character(),
## `Nucleotide RefSeq IDs` = col_character(),
## `Protein RefSeq IDs` = col_character(),
## `SWISS_PROT IDs` = col_character()
## )
homology
gwas_gene_hom <- homology %>%
filter(Symbol %in% gwas_region_genes$gene_name)
gwas_mouse_hom <- homology %>%
filter(`HomoloGene ID` %in% gwas_gene_hom$`HomoloGene ID`) %>%
filter(`NCBI Taxon ID` == 10090)
gwas_mouse_genes = gwas_mouse_hom$Symbol
(4) Module Enrichment
In this section, we conduct Fisher's Exact tests for module enrichment for GWAS genes, disease genes, and phenotype associated genes and generate tables and plots of the results.
Identify modules enriched for GWAS genes
# create list of all modules
colors = unique(moduleColors)
# create object for each module and compute enrichment of overlap between module and GWAS gene list
for (i in 1:length(colors)){
color = colors[i]
matrix_name = paste("mod_",color,"_gene_exp", sep = "")
assign(paste0("mod_",color,"_gene_exp"), subset(gene_MEs, gene_MEs$mod_color == color))
assign(paste0("mod_",color,"_gene_exp"), as.data.frame(get(paste0("mod_",color,"_gene_exp"))))
assign(paste0("mod_",color,"_gene_ids"), rownames(get(paste0("mod_",color,"_gene_exp"))))
assign(paste0("mod_",color,"_trx_info"), annotated_mods %>% filter(mod_color == color))
d = as.data.frame(get(paste0("mod_",color,"_trx_info")))
x = unique(d$external_gene_name) %in% gwas_mouse_genes
x = sum(x == TRUE)
a <- x
b <- (417 - a)
c <- (dim(d)[1] - a)
e <- (29255 - c - b - a)
assign(paste0("ft_mod_",color), fisher.test(matrix(c(a,b,c,e),2,2,byrow=TRUE),
alternative='greater'))
}
# format results in table
gwas_enrichment_results = as.data.frame(matrix(nrow=59,ncol=3))
colnames(gwas_enrichment_results) <- c('module_color','p_value','odds_ratio')
for (i in 1:length(colors)){
color = colors[i]
ft = get(paste0("ft_mod_",color))
gwas_enrichment_results[i,1] = color
gwas_enrichment_results[i,2] = ft$p.value
gwas_enrichment_results[i,3] = ft$estimate
}
# Computed FDR values, -log10(p-values), and view results
gwas_enrichment_results$p.adj = p.adjust(gwas_enrichment_results$p_value, method = "fdr", n = length(gwas_enrichment_results$p_value))
gwas_enrichment_results$neg_log10 = -log10(gwas_enrichment_results$p.adj)
gwas_enrichment_results %>% arrange(p.adj)
Plot the module enrichment and significance for GWAS genes
gwas_enrichment_results %>%
ggplot(aes(x = odds_ratio, y = neg_log10)) +
geom_point(aes(color = module_color), size = 3) +
geom_point(color = "black", size = 3, pch = 21) + theme_bw() +
geom_hline(yintercept = -log10(0.05), color = "red") +
scale_colour_identity()
Browse enriched modules
The annotated modules can be retrieved so the genes in the enriched modules can be browsed.
# Filtering annotated modules to browse one module
annotated_mods %>%
filter(mod_color == "tan")
Identify modules enriched for genes associated with monogenic bone disorders
In addition to identifying modules enriched for GWAS genes, modules enriched for genes known to cause related monogenic diseases can be identified
# First, the human monogenic gene IDs are converted to mouse IDs
mono_gene_hom <- homology %>%
filter(Symbol %in% monogenic_genes)
mono_mouse_hom <- homology %>%
filter(`HomoloGene ID` %in% mono_gene_hom$`HomoloGene ID`) %>%
filter(`NCBI Taxon ID` == 10090)
mono_mouse_genes = mono_mouse_hom$Symbol
# Then the loop is run to identify enriched modules
for (i in 1:length(colors)){
color = colors[i]
d = as.data.frame(get(paste0("mod_",color,"_trx_info")))
x = unique(d$external_gene_name) %in% mono_mouse_genes
x = sum(x == TRUE)
a <- x
b <- (40 - a)
c <- (dim(d)[1] - a)
e <- (29255 - a - b -c)
assign(paste0("ft_mod_",color), fisher.test(matrix(c(a,b,c,e),2,2,byrow=TRUE),
alternative='greater'))
}
# the results table is formatted
mono_results = as.data.frame(matrix(nrow=59,ncol=3))
colnames(mono_results) <- c('module_color','p_value','odds_ratio')
for (i in 1:length(colors)){
color = colors[i]
enrichment = get(paste0("ft_mod_",color))
mono_results[i,1] = color
mono_results[i,2] = enrichment$p.value
mono_results[i,3] = enrichment$estimate
}
mono_results$p.adj = p.adjust(mono_results$p_value, method = "fdr", n = length(mono_results$p_value))
mono_results$neg_log10 = -log10(mono_results$p.adj)
filter(mono_results, mono_results$p.adj < 0.05) %>%
arrange(desc(odds_ratio))
Plot the module enrichment and significance for monogenic disease genes
mono_results %>%
ggplot(aes(x = odds_ratio, y = neg_log10)) +
geom_point(aes(color = module_color), size = 3) +
geom_point(color = "black", size = 3, pch = 21) + theme_bw() +
geom_hline(yintercept = -log10(0.05), color = "red") +
scale_colour_identity()
Identify modules enriched for genes associated with bone phenotypes in model organisms
Finally, modules enriched for genes known to cause related phenotypes in model organisms are identified
# Then the loop is run to identify modules enriched for phenotype associated genes
for (i in 1:length(colors)){
color = colors[i]
d = as.data.frame(get(paste0("mod_",color,"_trx_info")))
x = unique(d$external_gene_name) %in% phen_genes
x = sum(x == TRUE)
a <- x
b <- (1088 - a)
c <- (dim(d)[1] - a)
e <- (29255 - a - b -c)
assign(paste0("ft_mod_",color), fisher.test(matrix(c(a,b,c,e),2,2,byrow=TRUE),
alternative='greater'))
}
# the results table is formatted
phen_results = as.data.frame(matrix(nrow=59,ncol=3))
colnames(phen_results) <- c('module_color','p_value','odds_ratio')
for (i in 1:length(colors)){
color = colors[i]
enrichment = get(paste0("ft_mod_",color))
phen_results[i,1] = color
phen_results[i,2] = enrichment$p.value
phen_results[i,3] = enrichment$estimate
}
phen_results = arrange(phen_results, desc(odds_ratio))
phen_results$p.adj = p.adjust(phen_results$p_value, method = "fdr", n = length(phen_results$p_value))
phen_results$neg_log10 = -log10(phen_results$p.adj)
filter(phen_results, phen_results$p.adj < 0.05)
Plot the module enrichment and significance for genes associated with bone phenotype from model systems
phen_results %>%
ggplot(aes(x = odds_ratio, y = neg_log10)) +
geom_point(aes(color = module_color), size = 3) +
geom_point(color = "black", size = 3, pch = 21) + theme_bw() +
geom_hline(yintercept = -log10(0.05), color = "red") +
scale_colour_identity()
(5) LD Score Regression
LD score regression (ldsc package) is used to calculated the partitioned heritability attributed to SNPs surrounding the genes that compose each module. The github repo and the wiki for the ldsc package has detailed tutorials for how to carry out this analysis. It is not wrapped for R, but can be run on the command line using python. The general steps taken are outlined here:
# (1) First, we need to format the GWAS summary statistics for input into the lsdc algorithm
## munge sumstats, returns a file called out.sumstats.gz
# ./munge_sumstats.py \
# --out BMD \
# --merge-alleles w_hm3.snplist \
# --a1-inc \
# --sumstats bmd_gwas_sumstats.txt
# (2) Generate genesets for each module for all chromosomes; this requires a list of gene identifiers for the genes in each module, a file indicated the coordinates for each identifier, and the plink .bim files of pre-computed LD scores. In this application we use the 1000 genomes plink files, the Genesets are listed as Ensembl gene IDs, and the gene coordinate file "ENSG_coord.txt", provided with the ldsc package, and we use a windowsize of 100,000, as recommended by the authors of the application
## make annotations for each module and each chromosome
# python ../../src/ldsc/make_annot.py --gene-set-file violet_module_human_gene_ids.Geneset --gene-coord-file ENSG_coord.txt --windowsize 100000 --bimfile ./1000G_plinkfiles/1000G.mac5eur.1.bim --annot-file ./violet_annot/violet_module.1.annot.gz
# (3) Finally, using all of these annotations, run ldsc using the processed summary statistics, the base annotation paths for the modules' annotations, the SNP weights and frequencies for the European 1000 Genomes data that are provided with the ldsc package, the overlap annotations was used because transcripts were used to generate the co-expression network, so the gene sets are non-disjoint, and finally, a base name for the output is provided.
## compute LDSC, outputs out.log and out.results
# python ldsc.py
# --h2 BMD.sumstats.gz\
# --ref-ld-chr antiquewhite4_module.,
# bisque4_module.,black_module.,
# blue_module., ...etc.\
# --w-ld-chr ./weights_hm3_no_hla/weights.
# --overlap-annot\
# --frqfile-chr 1000G.mac5eur.\
# --out BMD_all_modules_compare
### The output of the last step includes a log, which echos the command used to run the regression and provides a summary of the results, and a results table, which reports the proportion of SNPs, the proportion of heritability, the standard error fo the proportion of heritability, the heritability, the standard error of the heritability, and a p-value indicating the statistical significance of the enrichment
#ldsc_log = read_lines("./data/ld_score_regression/BMD_all_modules_compare.log")
#ldsc_results = read_tsv("./data/ld_score_regression/BMD_all_modules_compare.results")
# The results table has a category for each annotation provided, but they have generic labels. The log keeps track of the annotation input, so we can get the names from the log, because the order of the annotation matters
# Here is an example of table of results from an ldsc run arranged by p-value
url = "https://www.cell.com/cms/10.1016/j.celrep.2020.108145/attachment/921fc33d-d674-4488-951f-295320b24cea/mmc5.xlsx"
download.file(url = url, destfile = "./data/ld_score_regression/ld_score_reg_res.xlsx")
ldsc_res = as.data.frame(read_excel("./data/ld_score_regression/ld_score_reg_res.xlsx", sheet = 3))
# A p-value adjustment is also applied to correct for testing across all modules
ldsc_res$padj = p.adjust(ldsc_res$Enrichment_p, method = "fdr", n = length(ldsc_res$Enrichment_p))
ldsc_res %>%
dplyr::select(module, Enrichment, padj) %>%
filter(padj < 0.05)
(6) Colocalization Analysis
In this section, the coloc package is used to perform genetic colocalization analysis to determine whether eQTL for genes of interest from the core module identified above share common causal variants with QTL for the trait of interest.
Computing colocalization of GTEX eQTL for one gene in all tissues and trait QTL
# For this purpose, the GWAS summary statistics we loaded above are sufficient, however, we must reformat the gwas data for input into coloc
gwas_coloc = gwas_sumstat[c(2,5,6,11,9,7)]
gwas_coloc$MAF = ifelse(gwas_coloc$A1FREQ > 0.5, (1- gwas_coloc$A1FREQ), (gwas_coloc$A1FREQ))
# In addition to the summary statistics for the GWAS, the eQTL for the genes of interest will also need to be read in. These have been filtered from the full set of associations in the GTEx v7 eQTL studies using awk in the command line, e.g. > awk -F "\t" '$1 ~ /ENSG###/ {print}' .txt | awk -F "\t" '{ if(($3 >= lower_coord_limit) && ($3 <= upper_coord_limit)) { print } }' > output_file.txt. Do this for each tissue in GTEx. The tarball you need to download from GTEx is very large, so you will likely want to execute this on a server. The file can be downloaded with this command >wget https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL_all_associations.tar.gz
# We will also need to know about the number of samples used in the eQTL analysis for each tissue, so the key with this data need to be read in
tis = read_tsv("https://github.com/Farber-Lab/STAR_protocols_core_modules/raw/main/data/eqtl_data/tissue_key.txt")
# with these files in hand we can loop over all of the tissues and test colocalization in each tissue
gene_coloc_results = data.frame(matrix(NA, nrow = length(tis$Tissue), ncol = 8)) #empty table to fill with results for each tissue
for (i in 1:length(tis$Tissue)){
tissue = as.character(tis$Tissue[i])
z = read_tsv(url(paste0("https://github.com/Farber-Lab/STAR_protocols_core_modules/raw/main/data/eqtl_data/b4galnt3_snps/", tissue, "_b4galnt3_ebmd_snps.txt")), col_names = FALSE) # this will read in the eqtl file
# formatting
cols <- c("X2", "X3", "X4", "X5", "X6")
z$variant_id <- do.call(paste, c(z[cols], sep="_"))
z = z[,c(1,14,7,8,9,10,11,12,13)]
colnames(z) = c("gene_id", "variant_id", "tss_distance", "ma_samples", "ma_count", "maf",
"pval_nominal", "slope", "slope_se")
# add in RS_IDs
gene_snp_ids = GTExIdConvert(z$variant_id)
z = merge(z, gene_snp_ids, by = "variant_id")
tissue_n = as.numeric(tis[which(tis$Tissue == tissue),2])
# make coloc objects
gene.coloc = list(pvalues=as.numeric(z$pval_nominal),N=as.numeric(tissue_n),type='quant',
snp=as.character(z$rs_id), MAF=as.numeric(z$maf))
gwas.coloc = list(pvalues=as.numeric(gwas_coloc$P),N=142487,type='quant',
snp=as.character(gwas_coloc$SNP), MAF=as.numeric(gwas_coloc$MAF))
# run coloc
coloc_x = coloc.abf(gene.coloc,gwas.coloc)
# write out coloc results
gene_coloc_results[i,] = c(tissue,z$gene_id[1],coloc_x$summary[1],coloc_x$summary[2],coloc_x$summary[3],
coloc_x$summary[4],coloc_x$summary[5],coloc_x$summary[6])
}
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.12e-12 2.56e-06 8.28e-07 1.00e+00 2.97e-06
## [1] "PP abf for shared variant: 0.000297%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 4.78e-14 2.56e-06 1.86e-08 1.00e+00 6.45e-08
## [1] "PP abf for shared variant: 6.45e-06%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.21e-06 2.24e-07 8.63e-01 8.74e-02 4.99e-02
## [1] "PP abf for shared variant: 4.99%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 1.28e-13 1.99e-08 4.97e-08 6.78e-03 9.93e-01
## [1] "PP abf for shared variant: 99.3%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 5.85e-58 1.90e-08 2.28e-52 6.41e-03 9.94e-01
## [1] "PP abf for shared variant: 99.4%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.19e-06 2.56e-07 8.55e-01 9.97e-02 4.49e-02
## [1] "PP abf for shared variant: 4.49%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.31e-06 1.50e-07 9.00e-01 5.83e-02 4.15e-02
## [1] "PP abf for shared variant: 4.15%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 1.91e-06 2.72e-07 7.44e-01 1.06e-01 1.50e-01
## [1] "PP abf for shared variant: 15%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 1.16e-06 8.44e-07 4.53e-01 3.29e-01 2.18e-01
## [1] "PP abf for shared variant: 21.8%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.20e-06 2.25e-07 8.57e-01 8.75e-02 5.54e-02
## [1] "PP abf for shared variant: 5.54%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.04e-06 3.96e-07 7.94e-01 1.54e-01 5.15e-02
## [1] "PP abf for shared variant: 5.15%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 1.76e-06 1.48e-07 6.85e-01 5.75e-02 2.58e-01
## [1] "PP abf for shared variant: 25.8%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.11e-06 3.18e-07 8.23e-01 1.24e-01 5.27e-02
## [1] "PP abf for shared variant: 5.27%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.07e-06 2.07e-07 8.06e-01 8.07e-02 1.14e-01
## [1] "PP abf for shared variant: 11.4%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.02e-06 3.21e-07 7.87e-01 1.25e-01 8.78e-02
## [1] "PP abf for shared variant: 8.78%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.27e-06 1.75e-07 8.84e-01 6.81e-02 4.82e-02
## [1] "PP abf for shared variant: 4.82%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.12e-06 2.17e-07 8.28e-01 8.47e-02 8.70e-02
## [1] "PP abf for shared variant: 8.7%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.27e-06 1.68e-07 8.85e-01 6.53e-02 4.99e-02
## [1] "PP abf for shared variant: 4.99%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.06e-06 1.79e-07 8.02e-01 6.95e-02 1.28e-01
## [1] "PP abf for shared variant: 12.8%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 4.48e-16 2.56e-06 1.74e-10 1.00e+00 1.00e-08
## [1] "PP abf for shared variant: 1e-06%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.20e-06 1.86e-07 8.59e-01 7.23e-02 6.83e-02
## [1] "PP abf for shared variant: 6.83%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.21e-06 2.28e-07 8.60e-01 8.90e-02 5.06e-02
## [1] "PP abf for shared variant: 5.06%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 1.27e-06 1.57e-07 4.95e-01 6.08e-02 4.44e-01
## [1] "PP abf for shared variant: 44.4%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.07e-06 2.74e-07 8.05e-01 1.07e-01 8.83e-02
## [1] "PP abf for shared variant: 8.83%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 3.14e-07 2.22e-06 1.22e-01 8.65e-01 1.29e-02
## [1] "PP abf for shared variant: 1.29%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 1.82e-06 4.32e-07 7.10e-01 1.68e-01 1.21e-01
## [1] "PP abf for shared variant: 12.1%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 1.66e-06 7.85e-07 6.48e-01 3.06e-01 4.57e-02
## [1] "PP abf for shared variant: 4.57%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 1.89e-06 1.81e-07 7.36e-01 7.02e-02 1.94e-01
## [1] "PP abf for shared variant: 19.4%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.18e-06 2.14e-07 8.51e-01 8.32e-02 6.54e-02
## [1] "PP abf for shared variant: 6.54%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.15e-06 2.68e-07 8.38e-01 1.05e-01 5.73e-02
## [1] "PP abf for shared variant: 5.73%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 1.39e-06 3.17e-07 5.42e-01 1.23e-01 3.35e-01
## [1] "PP abf for shared variant: 33.5%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.19e-06 2.33e-07 8.52e-01 9.09e-02 5.68e-02
## [1] "PP abf for shared variant: 5.68%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.93e-09 2.48e-06 1.14e-03 9.65e-01 3.35e-02
## [1] "PP abf for shared variant: 3.35%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.38e-08 2.50e-08 9.30e-03 8.75e-03 9.82e-01
## [1] "PP abf for shared variant: 98.2%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.12e-06 1.81e-07 8.28e-01 7.06e-02 1.02e-01
## [1] "PP abf for shared variant: 10.2%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.17e-06 1.96e-07 8.47e-01 7.63e-02 7.67e-02
## [1] "PP abf for shared variant: 7.67%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.21e-06 1.88e-07 8.60e-01 7.31e-02 6.72e-02
## [1] "PP abf for shared variant: 6.72%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 1.84e-06 4.65e-07 7.16e-01 1.81e-01 1.03e-01
## [1] "PP abf for shared variant: 10.3%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.09e-06 3.33e-07 8.13e-01 1.30e-01 5.69e-02
## [1] "PP abf for shared variant: 5.69%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 1.88e-06 1.81e-07 7.32e-01 7.03e-02 1.98e-01
## [1] "PP abf for shared variant: 19.8%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.55e-16 2.56e-06 9.93e-11 1.00e+00 9.97e-09
## [1] "PP abf for shared variant: 9.97e-07%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.22e-06 1.80e-07 8.66e-01 7.01e-02 6.40e-02
## [1] "PP abf for shared variant: 6.4%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 4.39e-07 2.09e-06 1.71e-01 8.17e-01 1.21e-02
## [1] "PP abf for shared variant: 1.21%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 1.52e-06 3.64e-07 5.95e-01 1.42e-01 2.64e-01
## [1] "PP abf for shared variant: 26.4%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.15e-06 1.88e-07 8.39e-01 7.32e-02 8.78e-02
## [1] "PP abf for shared variant: 8.78%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.19e-06 1.99e-07 8.53e-01 7.75e-02 6.93e-02
## [1] "PP abf for shared variant: 6.93%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 5.34e-34 2.56e-06 2.08e-28 1.00e+00 9.96e-09
## [1] "PP abf for shared variant: 9.96e-07%"
Looking at the coloc output
# take a look at the results, format them, and then look at the significant ones
gene_coloc_results
colnames(gene_coloc_results) = c("tissue", "gene", "nsnps", "PP.H0.abf", "PP.H1.abf", "PP.H2.abf", "PP.H3.abf", "PP.H4.abf")
subset(gene_coloc_results, PP.H4.abf > 0.50)
Plotting interesting coloc results using RACER package
For interesting coloc results, we can produce a mirror plot, mapping the GWAS summary statistics and the eQTL summary statistics onto the same genome coordinates for comparison
# Choosing a tissue with an interesting coloc result
tissue = "Artery_Coronary"
z = read_tsv(url(paste0("https://github.com/Farber-Lab/STAR_protocols_core_modules/raw/main/data/eqtl_data/b4galnt3_snps/", tissue, "_b4galnt3_ebmd_snps.txt")), col_names = FALSE)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## X1 = col_character(),
## X2 = col_double(),
## X3 = col_double(),
## X4 = col_character(),
## X5 = col_character(),
## X6 = col_character(),
## X7 = col_double(),
## X8 = col_double(),
## X9 = col_double(),
## X10 = col_double(),
## X11 = col_double(),
## X12 = col_double(),
## X13 = col_double()
## )
cols <- c("X2", "X3", "X4", "X5", "X6")
z$variant_id <- do.call(paste, c(z[cols], sep="_"))
z = z[,c(1,14,7,8,9,10,11,12,13)]
colnames(z) = c("gene_id", "variant_id", "tss_distance", "ma_samples", "ma_count", "maf",
"pval_nominal", "slope", "slope_se")
gene_snp_ids = GTExIdConvert(z$variant_id)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## X1 = col_character(),
## X2 = col_character()
## )
## [1] 1
## [1] 1
z = merge(z, gene_snp_ids, by = "variant_id")
eqtl_b4galnt3 = separate(z, variant_id, into = c("chr", "pos", "ref", "alt", "build"), sep = "_")
# Filtering GWAS data for plot
gwas_b4galnt3 = gwas_sumstat %>%
filter(CHR == as.numeric(eqtl_b4galnt3$chr[1])) %>%
filter(BP > min(eqtl_b4galnt3$pos)) %>%
filter(BP < max(eqtl_b4galnt3$pos))
# Using RACER package to generate mirror plot
gwas_b4galnt3_formatted = formatRACER(assoc_data = gwas_b4galnt3, chr_col = "CHR", pos_col = "BP", p_col = "P", rs_col = "SNP")
## Formating association data...
## Processing -log10(p-values)...
## Preparing association data...
eqtl_b4galnt3_formatter = formatRACER(assoc_data = eqtl_b4galnt3, chr_col = "chr", pos_col = "pos", p_col = "pval_nominal", rs_col = "rs_id")
## Formating association data...
## Processing -log10(p-values)...
## Preparing association data...
gwas_b4galnt3_ld = ldRACER(assoc_data = gwas_b4galnt3_formatted, rs_col = "RS_ID", pops = "EUR", lead_snp = "rs215226")
## All inputs are go!
## Reading in association data...
## Populations selected.
## Calculating LD using rs215226...
## [1] "https://ldlink.nci.nih.gov/LDlinkRest/ldproxy?var=rs215226&pop=EUR&r2_d=r2&token=c0f613f149ab"
## Merging input association data with LD...
eqtl_b4galnt3_ld = ldRACER(assoc_data = eqtl_b4galnt3_formatter, rs_col = "RS_ID", pops = "EUR", lead_snp = "rs215226")
## All inputs are go!
## Reading in association data...
## Populations selected.
## Calculating LD using rs215226...
## [1] "https://ldlink.nci.nih.gov/LDlinkRest/ldproxy?var=rs215226&pop=EUR&r2_d=r2&token=c0f613f149ab"
## Merging input association data with LD...
mirrorPlotRACER(assoc_data1 = eqtl_b4galnt3_ld, assoc_data2 = gwas_b4galnt3_ld, chr = 12, name1 = "B4GALNT3 eQTL", name2 = "BMD GWAS", plotby = "coord", start_plot = 500000, end_plot = 700000, label_lead = TRUE)
## Plotting by...
## coord
## Reading in association data
## Generating plot.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Warning: Removed 1 row(s) containing missing values (geom_path).
(7) PhenStat Analysis
While the colocalization analysis provides evidence supporting a relationship between network identified genes and a trait of interest, a causal relationship can only be demonstrated through controlled perturbation of a target and direct measurement of the phenotype of interest. While the hypotheses here can lead to a novel set of experiments, databases of experimental perturbations and measured phenotypes can be mined for evidence supporting a causal relationship between a gene and a phenotype of interest. For example, the International Mouse Phenotyping Consortium has a database of phenotypes measured in 7022 strains of knockout mice (IMPC release 12.0). - search for gene on main page https://www.mousephenotype.org/ - go to gene page https://www.mousephenotype.org/data/genes/MGI:3041155 - click all data table, search for relevant phenotype - click into a relevant phenotype https://www.mousephenotype.org/data/charts?accession=MGI:3041155&allele_accession_id=MGI:4434237&pipeline_stable_id=ESLIM_001&procedure_stable_id=ESLIM_005_001¶meter_stable_id=ESLIM_005_001_004&zygosity=homozygote&phenotyping_center=ICS - Right click and copy the link to download the “PhenStat-ready raw experiment data” - Copy the url into the url defition in the code chunk below
# Download the file
url = 'https://www.mousephenotype.org/data/exportraw?phenotyping_center=ICS¶meter_stable_id=ESLIM_005_001_004&allele_accession_id=MGI:4434237&strain=MGI:2164831&pipeline_stable_id=ESLIM_001&&zygosity=homozygote&'
dataset1 = data.table::fread(url)
# Change the date format, strip only the columns we need to do the statistical comparison
dataset1 = as.data.frame(dataset1)
dataset1 = dataset1[,c(15,16,21,26,28)]
dataset1$Assay.Date = as.character(dataset1$Assay.Date)
dataset1
Next, we create a test object for PhenStat that will go into the statistical test
test<-PhenList(dataset1,
testGenotype="EPD0140_5_G02",
dataset.colname.batch = "Assay.Date",
dataset.values.male="male",
dataset.values.female="female",
dataset.clean=TRUE,
outputMessages = TRUE)
## Information:
## Dataset's 'Genotype' column has following values: '+/+', 'EPD0140_5_G02'
## Information:
## Dataset's 'Sex' column has following value(s): 'Female', 'Male'
test@datasetPL
PhenStat:::getStat(test)
## Variables Numeric Continuous Levels NObs Mean StdDev Minimum
## 1 Batch FALSE FALSE 104 920 NA NA NA
## 2 Genotype FALSE FALSE 2 920 NA NA NA
## 3 Sex FALSE FALSE 2 920 NA NA NA
## 4 Value TRUE TRUE 909 920 0.04464975 0.003652078 0.03730429
## 5 Weight TRUE TRUE 130 920 24.60663043 2.951539971 17.50000000
## Maximum
## 1 NA
## 2 NA
## 3 NA
## 4 0.06429247
## 5 32.70000000
Next, use testDataset to test for differences in a dependant variable, here "Value", which is the BMD measurement. The program will choose whether to keep specific model effect, and in this case it corrects for batch, weight, and sex, but not an interaction term. Additionally, it does not detect a difference variances.
results_MM_bmd = testDataset(test, depVariable = "Value")
## Information:
## Dependent variable: 'Value'.
## Information:
## Perform all MM framework stages: startModel and finalModel.
## Information:
## Method: Mixed Model framework.
## Information:
## Equation: 'withWeight'.
## Information:
## Calculated values for model effects are: keepBatch=TRUE, keepEqualVariance=TRUE, keepWeight=TRUE, keepSex=TRUE, keepInteraction=FALSE.
Use summary output to view the results of the statistical test for comparing BMD. You can see that there is an effect of genotype, there is no sexual dimorphism, and the effect of weight in the model was significant.
summaryOutput(results_MM_bmd)
##
## Test for dependent variable:
## *** Value ***
##
## Method:
## *** Mixed Model framework ***
## ----------------------------------------------------------------------------
## Model Output
## ----------------------------------------------------------------------------
## Final fitted model: Value ~ Genotype + Sex + Weight
## Was batch significant? TRUE
## Was variance equal? TRUE
## Genotype p-value: 5.241566e-02
## Genotype effect: -1.957718e-03 +/- 1.008667e-03
## Was there evidence of sexual dimorphism? no (p-value 8.495508e-01)
## Genotype percentage change Female: -4.38%
## Genotype percentage change Male: -4.38%
##
## ----------------------------------------------------------------------------
## Classification Tag
## ----------------------------------------------------------------------------
## With phenotype threshold value 0.01 - no significant change
##
## ----------------------------------------------------------------------------
## Model Output Summary
## ----------------------------------------------------------------------------
## Value Std.Error DF t-value p-value
## (Intercept) 0.0239276920 1.301252e-03 813 18.388211 2.074091e-63
## GenotypeEPD0140_5_G02 -0.0019577183 1.008667e-03 813 -1.940897 5.261613e-02
## SexMale -0.0010626637 3.414265e-04 813 -3.112423 1.920506e-03
## Weight 0.0008632046 5.786314e-05 813 14.918039 1.158340e-44
Finally, we can create a boxplot of the differences in the value between genotypes for both sexes.
boxplotSexGenotype(test,
depVariable = "Value",
graphingName = "Bone Mineral Density",
outputMessages = T)
PhenStat:::boxplotSexGenotypeBatchAdjusted(test, depVariable = "Value")
## Information:
## Value variable is adjusted treating batch as a random effect.